Back

Scientific Data

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Scientific Data's content profile, based on 174 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.

1
Hyperspectral imaging of Marchantia

Tan, G. Z. H.; Urano, D.

2026-05-29 plant biology 10.64898/2026.05.28.721262 medRxiv
Top 0.1%
32.0%
Show abstract

Hyperspectral imaging is an imaging technique that allows for acquisition of high-resolution spectral information beyond that of the visible spectrum. When applied to plants, it effectively enables non-invasive characterization of physiological status and has been widely used in agricultural settings. Marchantia is a model bryophyte species whose flat morphology and visually distinct stress-response phenotypes makes it an ideal candidate for imaging studies. Here, we provide a comprehensive protocol for hyperspectral imaging for Marchantia plants, which encompasses hardware configuration, data acquisition, and computations processing. This protocol features a streamlined data processing pipeline hosted on a web-based development platform that automates 1) the segmentation of plant area into spatially distinct regions for localized analysis of intra-specimen physiological gradients, and 2) classification of plant pixels based on their spectral signatures. All results are exported as structured CSV files for ease of further analysis as desired by the user.

2
Nipoppy: A framework for standardizing neuroimaging studies to facilitate international derived-data sharing

Bhagwat, N.; Wang, M.; Dugre, M.; Pfarr, J.-K.; Dai, A.; Urchs, S.; McPherson, B.; Gau, R.; van Heese, E. M.; d'Angremont, E.; Laansma, M. A.; Prasad, S.; Sanz-Robinson, J.; Torabi, M.; Jahanpour, A.; Danyluik, M.; Joubert, A.; Macdonald, A.; Waller, L.; Stewart, A.; Joulot, M.; Dickie, E.; Devenyi, G. A.; Bouix, S.; Bollmann, S.; Jahanshad, N.; Thompson, P. M.; Burgos, N.; Chakravarty, M. M.; Halchenko, Y. O.; van der Werf, Y. D.; Poline, J.-B.

2026-05-21 bioinformatics 10.64898/2026.05.18.723593 medRxiv
Top 0.1%
28.7%
Show abstract

Neuroimaging data management and processing are tedious and error-prone, prompting reproducibility concerns. Globally, studies with heterogeneous infrastructure and governance policies lead to eclectic data processing and sharing, necessitating standardization of data workflows to ensure reusability and comparability of multi-centric datasets. The Nipoppy neuroinformatics framework facilitates such standardization by combining specification, protocol, and software to manage study-level data workflows. With its adoption, researchers can share standardized, derived datasets enabling efficient, reproducible, and inclusive research.

3
Opening a standardized, spatially contiguous biodiversity database collected over 40 years: Czech breeding bird atlases 1973-77, 1985-89, 2001-03, and 2014-17

Ortega-Solis, G. R.; Stastny, K.; Bejcek, V.; Telensky, T.; Mellado-Mansilla, D.; Zarybnicky, J.; Grattarola, F.; Zarybnicka, M.; Vermouzek, Z.; Vorisek, P.; Leroy, F.; Tietje, M.; Soria, C. D.; Padulosi, E.; Travnickova, E.; Wolke, F. J. R.; Keil, P.

2026-05-29 ecology 10.64898/2026.05.29.728651 medRxiv
Top 0.1%
28.4%
Show abstract

MotivationHigh-quality biodiversity data with temporal replicates, produced using standardized fieldwork protocols, are rare yet essential for studying long-term biodiversity dynamics. Most available large-scale temporal data only date back one or two decades and/or originate from spatially discrete local observations. Here, we release spatially contiguous, systematically collected, and gridded occurrence data for breeding birds in Czechia, covering the periods 1973--1977, 1985--1989, 2001--2003, and 2014--2017. This database represents the monitoring of ca. 41% of European bird species over 40 years, and it is one of the longest-running nationwide bird-monitoring efforts in the world. We also complement the original data with geospatial metrics to characterize the sampling polygons and provide proxies of sampling effort. By making this dataset openly accessible, we aim to strengthen biodiversity change studies, citizen science, and ornithological research with long-term, highly curated records, backed by well-documented methods, and ready for integration with other datasets. Main Types of Variables ContainedA total of 286302 breeding bird detections/non-detections per-grid-cell from 247 species (ca. 41% of the 596 species breeding in Europe). The fourth atlas also contains 9,471 timed species lists totaling 276076 additional records collected with standardized effort and partially random spatial sampling on smaller squares dividing the original grid cells. Spatial Location and GrainCzechia (total area of 78,871 km2) covered by a grid of 887 grid cells of 10 by 10 km for the period 1973--77, and 678 cells of 6 minutes latitude and 10 minutes longitude ([~]11.2 x 12 kilometers) from 1985 onwards. The timed species lists were collected across 4,851 of 9,844 small squares ([~]2.8 x 3 km) that subdivide each original grid-cell into 16 smaller polygons. Time Period and GrainThe sampling years were 1973--1977 (5 breeding seasons), 1985--1989 (5 breeding seasons), 2001--2003 (3 breeding seasons), and 2014--2017 (4 breeding seasons). Major Taxa and Level of MeasurementBirds (Aves). The breeding evidence per species and grid cell was classified following the European Breeding Birds Atlas 2. We provide species-level records matched to the HBW/BirdLife version 9 (2024). FormatThe dataset is available for download from Zenodo and is provided as CSV files with fields standardized to Darwin Core, and a GeoPackage file containing all of the spatial grids used. The data are organized into separate files for records and sampling events, corresponding to each atlas. All data are licensed under CC-BY 4.0.

4
Digital Atlases to Unlock the Potential of Brain Biorepository Tissues for Interdisciplinary Research

Webster, J. M.; Shojaie, A.; Shen, Y. A.; Le, T.; Ragaglia, E.; Bogdani, M.; Kirkland, A.; Mac Donald, C.; Latimer, C. S.; Keene, C. D.; Grabowski, T. J.

2026-05-15 neuroscience 10.64898/2026.05.13.724753 medRxiv
Top 0.1%
26.8%
Show abstract

Human brain tissue preserved in biorepositories is foundational for the structural, cellular, and biomolecular research necessary for a mechanistic understanding of neurological diseases. Realizing the research potential of these valuable resources requires well-characterized research-relevant tissue that can be efficiently identified by investigators and incorporated into the conceptual and computational frameworks of interdisciplinary research. Several large-scale efforts to improve research reliability and reproducibility have sought to characterize and annotate the processes by which these samples are collected, yet limited progress has been made on standardizing spatial information for these samples. Biorepositories systematically collect brain tissue according to a brain sampling protocol (BSP) that differs between institutions, yet explicit spatial information regarding the samples may not be documented in standard operating procedures (SOPs). The amount of anatomical location details available to investigators are inconsistent across biorepositories and typically lack sufficient anatomical precision to ensure correspondence with samples from other biorepositories or research relevant brain regions specified by neuroimaging, functional, or disease-susceptibility criteria. Here, we introduce a pipeline for developing a Spatial Atlas for Mapping Protocol Locations of Ex vivo Samples (SAMPLES), which uses a neuroimaging framework to create a 3D representation of a BSP through a metrically precise digital instantiation of the procedures for brain extraction, segmentation, slicing, and sampling on a modern digital brain template. SAMPLES incorporates modern neuroinformatics conventions to create explicit 3D labels of BSP-defined samples that can be interactively visualized with freely available neuroimaging software. We illustrate the pipeline by developing an atlas for the protocol from the University of Washington BioRepository and Integrated Neuropathology laboratory (UW BRaIN SAMPLES). By providing an explicit, computable reference, SAMPLES atlases can support the efficient identification, referencing, and utilization of postmortem samples for interdisciplinary research. These capabilities enable biorepository workflows, data harmonization across biorepositories, and integration with antemortem neuroimaging.

5
An fMRI dataset of verbalized spontaneous thought with annotated transcripts and self-report trait measures

Zhang, M.; Liu, P. R.; Su, H.; Zhao, M.; Li, X.; Born, S.; Lee, Y.; Honey, C.; Chen, J.; Lee, H.

2026-05-12 neuroscience 10.64898/2026.05.12.724488 medRxiv
Top 0.1%
22.5%
Show abstract

Spontaneous thought is pervasive in everyday human cognition, yet datasets capturing its neural dynamics under minimally interrupted conditions remain limited. The current dataset was acquired from a think-aloud functional MRI experiment in which 118 participants continuously verbalized their spontaneous thoughts during 10-minute scanning sessions. The raw MRI data and verbal transcripts with sentence-level timestamps were previously released and analyzed in our prior study examining neural activity associated with thought transitions. Building on that release, we additionally provide preprocessed MRI data, speech transcriptions with word-level timestamps aligned to image acquisition, large language model-generated ratings of transcribed thoughts across emotional and sensory dimensions, and self-report survey measures assessing personality, mental health, and cognitive abilities. Validation analyses demonstrated activation in expected cortical regions associated with speech production and sensory content identified from transcript annotations, agreement between language model and human ratings, and adequate internal consistency of survey measures, supporting the datasets overall quality. This dataset enables reuse for investigations of spontaneous thought, speech generation, and individual differences using naturalistic functional MRI data.

6
A dual EEG hyperscanning dataset of natural French face-to-face conversation

Yamasaki, H.; Blache, P.; Schön, D.

2026-05-15 neuroscience 10.64898/2026.05.13.724780 medRxiv
Top 0.1%
22.5%
Show abstract

Conversation is a fundamental human behaviour that requires rapid coordination between speaking, listening, and turn-taking, yet datasets capturing its neural dynamics in natural interaction remain scarce. Hyperscanning EEG is particularly valuable for this purpose because it records both interlocutors simultaneously, enabling the study of speaker-listener coupling, response timing, and dyadic coordination during live exchange. Here we present DUET (Dyadic Understanding, EEG and Turn-taking), a hyperscanning dataset for studying natural French face-to-face conversation. The dataset comprises recordings from 18 dyads, or 36 French-speaking adults, performing the Diapix collaborative spot-the-difference task across eight 4-minute face-to-face conversation blocks. For each participant, EEG was recorded from 36 participants; most recordings used 64-channel EEG, with one pilot dyad recorded using 32 electrodes. The public release includes raw EEG recordings, precomputed ICA decompositions for reuse in downstream preprocessing as well as various features derived from the audio and manually corrected transcripts.

7
Evaluation of MeaSeq: comprehensive analysis and reporting of measles virus whole genome sequences.

Hole, D. T.; Abdalla, A.; Zubach, V.; Pratt, M.; Van Driel, S.; Ashfaq, S.; Hiebert, J.; Duggan, A. T.

2026-05-14 bioinformatics 10.64898/2026.05.12.724559 medRxiv
Top 0.1%
21.7%
Show abstract

Although vaccine-preventable, measles virus (MeV) continues to pose a significant public health challenge, with a substantial resurgence of cases worldwide. As whole-genome sequencing (WGS) becomes increasingly affordable and routinely adopted in public health laboratories, reliable and accessible analysis of next-generation sequencing (NGS) data is critical for outbreak investigation and molecular surveillance. Here, we present MeaSeq, a fast, user-friendly, open-source bioinformatics pipeline for MeV analysis using Illumina or Oxford Nanopore Technologies (ONT) NGS data. MeaSeq performs quality control assessments, consensus genome assembly and variant detection, optional genotype-specific reference selection, Distinct Sequence Identifier (DSId) assignment via user-provided databases or hashing, sub-consensus variant visualization, genome quality assessment, and standardized HTML reporting. We compared the performance of MeaSeq on NGS data generated from multiple sequencing platforms and targeted enrichment strategies against gold-standard Sanger data, reference genomes, and publicly available comparative data. This validation demonstrates that MeaSeq provides an accurate, reproducible, and accessible solution for routine MeV WGS analysis, supporting genomic surveillance and outbreak response workflows in public health and research settings. Impact StatementThe recent surge in measles cases worldwide, causing several countries to lose their measles elimination status, underscores the urgent need for effective and accessible genomic surveillance. Our manuscript introduces MeaSeq, a comprehensive and open-source bioinformatics pipeline specifically designed for analyzing MeV NGS data. MeaSeq includes MeV specific analyses such as genotype prediction from sequencing reads with optional genotype-specific reference selection; DSId assignment; quality control checks such as genome rule-of-six divisibility and gene CDS validation; subconsensus nucleotide analysis with mixed-site highlighting; and genomic plotting. By leveraging NGS technology, our pipeline can facilitate the identification of transmission chains and may provide critical insights into the dynamics of MeV outbreaks. This information is essential for public health officials and researchers to implement targeted interventions and optimize vaccine strategies. Additionally, the open-source nature of MeaSeq fosters collaboration and innovation within the scientific measles community along with providing access to a wider range of researchers. Data SummaryThe MeaSeq pipeline code is available on GitHub (https://github.com/phac-nml/measeq). Comparative datasets of publicly available WGS data were accessed through the NCBI Sequence Read Archive under the following BioProjects: PRJNA869081 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA869081) PRJNA480551 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA480551) PRJNA1017431 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1017431) PRJNA1241325 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1241325) PRJNA1174053 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1174053) PRJNA1293457 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1293457) PRJNA843031 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA843031) Whole-genome sequences were included in the validation analysis if they consisted of paired-end data (Illumina) and achieved [≥]95% genome completeness following trimming of the 5' and 3' untranslated regions (UTRs). This criterion ensured sufficient genome coverage for robust validation while allowing for limited missing data arising from regions of low sequencing depth or amplicon dropout. A complete list of sequences included in the validation, along with their accession numbers, is provided in Supplementary Table 1.

8
HESTA: a curated and reusable database for the human early organogenesis spatiotemporal transcriptome atlas

Xu, Z.; Wang, W.; Li, Y.; Zhang, Y.; Chen, J.; Du, W.; Yang, T.

2026-05-29 bioinformatics 10.64898/2026.05.28.728391 medRxiv
Top 0.1%
14.3%
Show abstract

BackgroundHuman organogenesis is orchestrated by precise spatiotemporal gene expression. Mapping these dynamic processes requires transcriptomic data that preserve native anatomical context across continuous developmental stages. ResultsWe present a spatiotemporal transcriptome database of human embryogenesis, profiling 77 sagittal sections from 13 euploid embryos (CS12-CS23) using Stereo-seq, yielding 14,744,703 bin50 spots. The atlas annotates 50 organs and maps 198 molecularly distinct substructures, complemented by 607,093 snRNA-seq cells. The database features a Spatial Exploration module for locating sections and visualizing spatial distributions of organs and substructures, and an Organ Atlas module for visualizing gene expression, regulon activities, and pathway enrichment at the single-organ level across embryos. ConclusionsThis database provides an interactive resource to access spatial gene expression, substructures, and regulatory networks across 50 developing human organs, supporting further research into the mechanisms of human organogenesis.

9
Chromosome-level genome assemblies of the red algae Porphyra dioica and Porphyra linearis

Morcillo, J.; D hondt, S.; Lipinska, A.; Bouckenooghe, S.; Noyen, L.; Van de Vloet, A.; Vranken, S.; Knoop, J.; Leliaert, F.; De Clerck, O.

2026-05-16 genomics 10.64898/2026.05.14.725108 medRxiv
Top 0.1%
14.3%
Show abstract

As one of the earliest-diverging multicellular eukaryotic lineages, the bladed Bangiales (Rhodophyta) possess a deep evolutionary history with a central role in the multi-billion-dollar global seaweed aquaculture industry. Although North Atlantic representatives are emerging candidates for regional mariculture, the scarcity of high-quality genomic resources for these taxa hinders both fundamental research and commercial optimization. To address this, we present the first chromosome-level genome assemblies for two native European species: Porphyra dioica (150.44 Mbp) and Porphyra linearis (95.22 Mbp). By integrating Oxford Nanopore Technologies (ONT) long-read sequencing with Hi-C proximity ligation, we generated highly contiguous nuclear genomes resolved into five chromosomes. Structural gene models were predicted through the BRAKER3 pipeline, identifying 12,548 and 10,382 protein-coding genes for P. dioica and P. linearis, respectively. Subsequent homology-based functional annotation characterized 57.4% and 59.8% of these predicted proteins. Supplemented by circularized organellar genomes, these reference genomes provide a critical framework for future research, enabling comparative studies of Atlantic-Pacific divergence and facilitating the development of selective breeding programs for sustainable European aquaculture.

10
Superpixel-ComBat multi-site harmonization of unpaired T1W MRI data in Huntingtons disease

Coleman, A.; Chen, C.-L.; Hanson-Baiden, J.; Minhas, D. S.; Torbati, M. E.; Laymon, C. M.; Tabrizi, S. J.; Wild, E. J.; Tudorascu, D. L.; Scahill, R. I.; Byrne, L. M.

2026-05-29 neuroscience 10.64898/2026.05.26.727928 medRxiv
Top 0.2%
10.1%
Show abstract

BackgroundPooling multi-site MRI data is essential for well-powered neuroimaging analyses, particularly in Huntingtons disease (HD), where large cohorts are needed to study disease-stage heterogeneity and subtle progressive neuroanatomical change. However, scanner-related variability hinders direct data pooling, confounding image-level methods such as voxel-based morphometry (VBM). Superpixel-ComBat (SP-ComBat), a voxel-level image-harmonization framework, effectively removes scanner effects but depends on traveling-subject data that are rarely available retrospectively. We extend SP-ComBat to unpaired multi-site T1-weighted MRI by introducing a pseudo-pairing framework that leverages demographically matched controls across scanners as surrogate traveling subjects. MethodsTwo pipelines were developed to estimate scanner effects under retrospective constraints: pipeline 1 used a small set of well-matched pseudo-pairs (n = 4) with bootstrap resampling to address scanners with limited sample sizes, while pipeline 2 used the recommended number of pseudo-pairs (n = 16) without resampling. Pseudo-pair images were parcellated into 3D-superpixels, and ComBat was applied within clusters to estimate scanner-specific adjustments for native-space harmonization. Pipeline performance was assessed in a representative multi-study dataset comprising six scanners from three HD cohorts (HD-YAS, HD-CSF, TRACK-HD; N = 144) and replicated in the full multi-study dataset (FMD; N = 548). ResultsBoth pipelines improved image quality, aligned scanner-specific intensities, and preserved disease-related structural patterns. Pipeline 2 showed superior parameter re-estimation stability and was selected for the FMD. Harmonization eliminated systematic segmentation errors, enabled a single unified VBM pipeline across scanners, and increased sensitivity to HD-related voxel-wise atrophy. ConclusionsSP-ComBat was effectively adapted for harmonization of unpaired multi-site structural MRI, reducing scanner bias while preserving biological variability and supporting unified VBM analyses across scanners.

11
A pipeline for cell migration analysis in live-cell imaging data from human iPSC-derived forebrain assembloids.

Weidman, M. P.; Campbell, N. B.; Headings, C.; Chung, S.; Khan, M.; Kandukuri, A.; Lim, V.; Olubowale, G.; Kim, M.; Devor, A.; Zeldich, E.; Thunemann, M.

2026-05-18 neuroscience 10.64898/2026.05.17.725711 medRxiv
Top 0.2%
10.0%
Show abstract

During forebrain development, inhibitory interneurons and oligodendrocyte progenitor cells migrate long distances into the developing dorsal cortex. Human induced pluripotent stem cell-derived forebrain assembloids (FAs) provide direct experimental access to this migratory process in vitro. Using viral labeling to express yellow fluorescent protein (EYFP) and tandem-dimer tomato (tdTomato) driven by EF1 or SOX10 promoters, respectively, we tracked cells in FAs over 15-17h using spinning disk confocal microscopy. We developed an end-to-end processing pipeline for 4D volumetric imaging data, consisting of background subtraction and drift correction, manual cell coordinate tracking, and an analysis workflow to describe migratory cell behavior. Image preprocessing significantly improved data quality for subsequent manual tracking in datasets with heterogeneous labeling density and brightness. Trajectory analysis of 336 EYFP- and 337 tdTomato-labeled cells from twelve FAs indicates that most cells show super-diffusive directed motility. Our pipeline represents a key resource for cell tracking in FAs and similar three-dimensional platforms. This pipeline represents the first open tracking resource for iPSC-derived FAs and can be used as a ground-truth resource for the development of automated cell detection and tracking algorithms.

12
MICAFlow: Fast and Robust MRI Preprocessing Bridging Research Neuroimaging and Clinical Practice

Goodall-Halliwell, I.; DeKraker, J.; Bautin, P.; Mendelson, D.; Cabalo, D. G.; Sahlas, E.; Ngo, A.; Xie, K.; Lam, J.; Smith, M.; Hwang, Y.; Vavassori, L.; Milano, P.; Chen, J.; Dascal, A.; Ding, R.; Zhou, G.; Naish, M.; Mo, J.; Fadaie, F.; Cruces, R. R.; Bernhardt, B. C.

2026-05-29 bioinformatics 10.64898/2026.05.26.727725 medRxiv
Top 0.2%
9.9%
Show abstract

MICAFlow is a fully automated MRI preprocessing pipeline designed to translate advanced neuroimaging workflows from research into routine clinical practice. The pipeline emphasizes speed, robustness, and ease of use, focusing on structural and diffusion MRI. Key innovations include a Label-Augmented Modality-Agnostic Registration (LAMAReg) technique driven by deep learning segmentations for reliable cross-modal alignment, integration of state-of-the-art distortion corrections, and adherence to reproducible standards (Snakemake workflow, BIDSApp specifications). We describe the design of MICAFlow and evaluate its performance across heterogeneous datasets. First, accessibility: MICAFlow processes a multimodal MRI exam in minutes with clinically accessible hardware and without requiring GPU access, making it feasible for same-day clinical use. Second, registration accuracy: LAMAReg achieves cutting-edge multi-modal registration accuracy, yielding accurate alignment of diffusion MRI, FLAIR, and intra-subject T1-weighted images while remaining generally robust to common artifacts. Third, data reliability: Using identifiability, we show MICAFlow maintains consistent performance across diverse datasets, including subjects with pathology, and is closely comparable to contemporary pipelines. In sum, MICAFlows combination of machine learning and efficient workflows produces research-grade data quality with clinical-grade speed. This work demonstrates that advanced MRI preprocessing can be done fast and robustly, helping close the gap between research neuroimaging and broad clinical application of quantitative MRI techniques. The source code for MICAFlow is available here: https://github.com/MICA-MNI/micaflow, and for LAMAReg here: https://github.com/MICA-MNI/LAMAReg.

13
Compatibility of National Food Composition Databases with USDA FoodData Central: A Seven-Country LLM-Based Analysis

Nakagawa, S.; Yamamoto, A.

2026-06-01 nutrition 10.64898/2026.05.23.26353942 medRxiv
Top 0.2%
8.7%
Show abstract

To evaluate the international interoperability of food composition databases, we assessed the compatibility of seven national food composition tables with USDA FoodData Central (FDC) using the LLM-based matching method reported previously (Nakagawa and Yamamoto, 2026). Databases from four English-speaking countries (Canada, United Kingdom, Australia, and New Zealand), South Korea, and Japan were compared with 8,158 USDA FDC entries (SR Legacy and Foundation Foods, excluding Survey/FNDDS). Match rates varied by country (62.0-89.7%) and food category. After excluding six USDA categories unsuitable for cross-national comparison, 45.2% of the remaining 6,290 entries were not matched by any country. Canada showed the highest concordance, reflecting shared North American food supply. Japan and South Korea showed similar low coverage for vegetables and spices. These findings suggest that while USDA FDC represents a practical foundation for a globally comprehensive food composition database given its breadth, systematic incorporation of country-specific foods and classification schemes will be necessary to achieve true international interoperability.

14
Automatic segmentation of choroid plexus using deep learning across neurodegenerative diagnoses in the multi-site COMPASS-ND Study

Singh, M.; Dabo, F.; Trigiani, L. J.; Araujo, D.; Narayanan, S.; Badhwar, A.

2026-05-18 radiology and imaging 10.64898/2026.05.14.26353194 medRxiv
Top 0.2%
8.6%
Show abstract

The choroid plexus (ChP) plays a central role in cerebrospinal fluid production, immune signaling, and metabolic clearance, and has emerged as a potential imaging biomarker of neurodegeneration. However, accurate and scalable quantification of ChP volume remains challenging due to its complex morphology and low contrast on conventional MRI. The Automatic Segmentation of Choroid Plexus (ASCHOPLEX), a deep learning framework originally trained on healthy controls and multiple sclerosis cohorts, has not been systematically evaluated in neurodegenerative populations. Using T1-weighted MRI from the multi-center COMPASS-ND study, we assessed standard ASCHOPLEX performance in cognitively unimpaired (CU), Alzheimer's disease (AD), and Parkinson's disease (PD) participants (N = 30), followed by fine-tuning using expert manual segmentations (N = 60). Segmentation accuracy was evaluated using Dice, Jaccard, precision, and recall. The fine-tuned model was then applied to a larger cohort (N = 277) to derive normalized ChP volumes, which were compared across diagnostic groups using linear regression models. Fine-tuning significantly improved segmentation accuracy across all metrics (Dice: 0.45 to 0.84; Jaccard: 0.32 to 0.73; all p < 0.0001), enabling robust ChP delineation across sites and conditions. In the full cohort, normalized ChP volume was significantly higher in AD compared with CU and PD (p < 0.0001), while PD did not differ from CU (p = 0.31). These findings demonstrate that dataset-specific adaptation is essential for deploying deep learning segmentation models in heterogeneous neuroimaging cohorts. The refined ASCHOPLEX framework enables scalable ChP quantification and supports its use as a structural imaging marker in neurodegenerative disease.

15
VOGeo-Gaze: Calibration-Free, Geometry-Aware Deep Learning for Real-Time Gaze Tracking in Clinical Video-Oculography

Zhao, J.; Ahmadi, S.-A.; Decker, J.; Zwergal, A.; Eulenburg, P. z.; Flanagin, V. L.; Wuehr, M.

2026-05-29 health informatics 10.64898/2026.05.27.26354254 medRxiv
Top 0.2%
8.5%
Show abstract

Quantitative eye movement analysis is important for neuro- logical diagnostics, yet existing video-oculography (VOG) systems typ- ically require calibration, device-specific settings, or accurate gaze la- bels. We present VOGeo-Gaze, a real-time, calibration-free, geometry- aware neural network that estimates gaze by reconstructing anatomi- cally meaningful eyeball parameters from image features. The method combines segmentation-driven projection geometry, a refraction-aware pupil correction module, and temporal anatomical stabilization, so gaze is derived from interpretable eye geometry rather than direct angular regression. Trained only on the public TEyeD dataset with weak gaze supervision, VOGeo-Gaze was evaluated on 116 clinical recordings from 17 patients and 19 healthy subjects using EyeSeeCam, a clinical gold- standard VOG system. It achieved median absolute angular errors of 0.33{whitebullet} horizontally and 0.35{whitebullet} vertically, with nearly 92% of recordings below 1{whitebullet} error while operating at >300 FPS. These results demonstrate sub-degree clinical gaze estimation without subject-specific calibration, camera intrinsics, or accurate gaze labels, providing a scalable and inter- pretable alternative to conventional VOG pipelines. Code is available at https://github.com/DSGZ-MotionLab/VOGeo-Gaze.

16
An atlas of the human metabolome

Chan, J. K.; Ly, N. S.; Taverniti, O.; Gwynne, W. D.; Lieng, B. Y.; Affe, V.; Urquhart-Cox, V. T.; Alonzi, S. M.; Muhundan, M.; Denhart, A. J.; Edgar, L. J.; Quaile, A. T.; Montenegro-Burke, J. R.

2026-05-22 systems biology 10.64898/2026.05.21.726638 medRxiv
Top 0.2%
8.5%
Show abstract

Despite the emergence of cellular atlases like the Human Protein Atlas, no equivalent atlas exists for the human metabolome. Here, we present the Human Metabolome Atlas (HMA, hma.ccbr.utoronto.ca), a comprehensive map containing metabolomic profiles of 70 human cell lines across 22 tissues. With an [~]8-fold increase in coverage compared to other resources, the HMA contains quantitative data for 1768 metabolites at the highest identification confidence, encompassing over 50 lipid classes and a broad range of metabolic pathways. This constitutes the most extensive human metabolomic atlas available. Leveraging the HMA, we identified specific metabolic regulation within pathways and cell types and characterized metabolic processes like glycosylation and ferroptosis. Lastly, we developed a publicly available, interactive web-portal to facilitate custom data analysis for the broader scientific community.

17
An improved generic schema for high fidelity data linkage and sample tracing across complex multi-assay medical entomology studies

Kavishe, D. R.; Msoffe, R. V.; Mmbaga, S.; Tarimo, L. J.; Butler, F.; Kaindoa, E. W.; Govella, N. J.; Kiware, S. S.; Killeen, G.

2026-05-13 bioinformatics 10.64898/2026.05.11.724183 medRxiv
Top 0.2%
8.5%
Show abstract

Evidence-based decision making on malaria vector control strategies increasingly rely on triangulation of data which requires informatics systems that can integrate data from complex, multi-stage studies involving mosquitoes. This manuscript describes a performance evaluation of an extended version of the generic schema underpinning the VBDs360 platform, specifically improved to accommodate multiple distinct entomological assays spanning the field, insectary and laboratory. The utility of this extension, with respect to high-fidelity data linkage and robust sample traceability across complex entomological workflows, was evaluated through a case study conducted in southern Tanzania. Wild female mosquitoes were collected from 40 locations across a >4,000 km{superscript 2} area and then reared through multiple generations in an insectary before derived iso-female lineages were tested for phenotypic susceptibility to a pyrethroid insecticide. Such multi-generational lineages (F to F where n [&ge;] 2) were propagated to prevent non-heritable maternal effects on phenotype and produce enough progeny for standard WHO susceptibility assays. All samples were subsequently archived in a molecular laboratory, where all F specimens were tested for sibling species identity. A paper-based implementation of the extended schema enabled successful integration of 77,017 lines of data distributed across 6 different tables that spanned 3 distinct field, insectary, and laboratory workflows, implemented by three different teams working in different locations. At each step, fully independent and redundant primary and secondary keys enabled high fidelity error correction and sample tracing. Consistently perfect linkage between assay design and sample sorting data was achieved for F0 wild-caught adults, with 100% of 66,108 record successfully linked between field capture and morphological categorization. This complete traceability extended to the propagation of derived Fn lineages, with all 100 and 243 records from 9 adult-derived and 13 larval-derived lineages, respectively, correctly linked. Insecticide susceptibility phenotype further confirmed 100% linkage for 5,654 records between exposure history and recorded mortality outcome data in the insectary. Although such cross-cleaned linkages to sample analysis and storage data recorded by the laboratory team were not entirely perfect and could be improved, they were nevertheless of very high fidelity (97.3% (1967/2,022) for F0 samples and 99.3% (437/440) for Fn samples). Overall, this pilot application of the extended generic schema ensured robust data provenance and minimized transcription errors in this complex study distributed across multiple teams and locations. These findings demonstrate how this generic informatics framework may be scaled and adapted to support data integrity across diverse, large-scale, multi-team entomological research workflows.

18
An extensible laboratory information management system for data harmonization across research centers: The ICTS-Dashboard

King, C. H.; De Dios, I.; Barrick, R.; Berger, S.; Almalvez, M.; Auriga, L.; Delot, E. C.; Xiao, C.; LoTempio, J.; Vilain, E.

2026-06-02 health informatics 10.64898/2026.05.31.26354439 medRxiv
Top 0.2%
7.4%
Show abstract

Background: Collaborative research programs increasingly require infrastructure capable of integrating heterogeneous participant, sample, and experimental data while meeting evolving research needs. Existing tools, including clinical EHRs, REDCap, generic research information management systems, and bespoke database builds, were not designed to operationalize project-specific data models. The Institute for Clinical and Translational Science (ICTS) at the University of California, Irvine (UCI) ICTS-Dashboard fills this need by providing a general purpose research information management system. Methods: We describe the ICTS-Dashboard, built as an open-source, schema-driven platform in which database structure, server-side validation, representational state transfer application programming interfaces (REST APIs), web-based forms, and reproducible exports are all generated from a single versioned java script object notation (JSON) Schema set. The backend is implemented in Django, Django REST Framework, and PostgreSQL; the frontend in React. We instantiate the platform with the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Data Model and extend it with two case studies: a locally developed biobank table for biospecimen logistics, and an embedded adaptation of the RAG-HPO retrieval-augmented phenotype curation tool. Results: The ICTS-Dashboard deployed at the UCI-GREGoR site supports 37 schema-derived tables and 250 documented API endpoints. It holds metadata for 2,558 participants, 1,237 families, 5,517 biobank entries, 2,466 sequenced biospecimens, and 289 genetic findings, and supports quarterly external data submissions regenerated directly from the database. The biobank extension adds entities the consortium does not standardize while preserving foreign-key linkage to rare disease records; the RAG-HPO module adds curator-mediated phenotype normalization against 19,389 indexed HPO terms. Both were integrated without modifying the GREGoR data model. Conclusion: A version-controlled, machine-readable data model can serve not only as a data sharing standard but as the operational backbone of a research program when paired with schema-governed tooling. The Dashboard's architecture is not intrinsic to a data model or to rare disease; any collaborative research program with a structured, versioned model can adopt the same pattern to reduce implementation overhead and improve reproducibility, harmonization, and findable, accessible, interoperable, and reproducible (FAIR)-aligned accessibility.

19
Highly contiguous reference genome assembly of the endangered Orces blue whiptail Holcosus orcesi

Pozo, G.; Cisneros-Heredia, D. F.; Barragan-Orbe, D.; Sanchez-Nivicela, J. C.; Arbelaez, E.; Torres, M.

2026-05-16 genomics 10.64898/2026.05.14.725226 medRxiv
Top 0.2%
6.7%
Show abstract

Holcosus orcesi, the Orces Blue Whiptail, is a Critically Endangered lizard endemic to the upper Jubones River basin in southern Ecuador. Restricted to a narrow elevational range within semi-arid Andean shrublands, it represents one of the few montane members of a predominantly lowland lineage. Here we present the first high-quality reference genome for H. orcesi, generated using Oxford Nanopore Technologies long-read sequencing. The assembly spans 1.68 Gb across only 91 contigs, with an N50 of 76.2 Mb and a BUSCO completeness of 96.8%, making it among the most contiguous and complete squamate genomes to date. Structural annotation predicted 25,682 genes, of which 85% showed homology to known proteins and 45% were assigned Gene Ontology terms. Repetitive elements accounted for 46.3% of the genome, with LINEs representing the predominant class. This genome provides a foundational resource for future evolutionary, comparative and conservation-genomic research of H. orcesi and other mountain reptiles, enabling studies of population genomics, local adaptation, and genomic erosion in isolated populations. By expanding the genomic representation of tropical montane reptiles, this work helps address longstanding phylogenetic and geographic gaps in global biodiversity genomics and provides a foundation for evidence-based conservation of H. orcesi and related taxa.

20
Chromosome-level genome assembly and annotation of the threatened marbled teal (Marmaronetta angustirostris)

Ortego, J.; Lopez-Luque, R.; Backstrom, N.; Green, A. J.

2026-05-14 genomics 10.64898/2026.05.12.723956 medRxiv
Top 0.2%
6.7%
Show abstract

The marbled teal (Marmaronetta angustirostris) is a widely distributed but declining waterfowl species, classified as Near Threatened globally and Critically Endangered in Spain. Despite ongoing conservation actions, including ex situ management and population reinforcement programmes, the genomic consequences of long-term captivity, inbreeding, and patterns of functional genetic variation remain unknown due to the absence of a species-specific reference genome. Here, we present the first chromosome-level genome assembly for this species. The genome was generated using PacBio HiFi long reads and Omni-C data, yielding a 1.15Gb assembly with a scaffold N50 of 76.95Mb. A total of 97.16% of the assembly was anchored into 36 chromosome-scale scaffolds, including the Z and W sex chromosomes. BUSCO analysis recovered 99.2% of conserved avian genes. Gene prediction was performed using both ab initio and homology-based strategies, resulting in 16,048 protein-coding genes. This resource provides a foundation for genomewide analyses of inbreeding, demographic history, and adaptive variation, and will support evidencebased in situ and ex situ conservation strategies for this threatened species.